Goto

Collaborating Authors

 average score


Crimson Desert: The all-you-can-eat video game divides critics

BBC News

Video game fans and big, blockbuster releases have had an uneasy relationship in recent years. As so-called triple-A games get more expensive to make, the publishers behind them are accused of taking fewer risks and failing to try new things. But highly anticipated new release Crimson Desert asks a different question - what if a big-budget, graphically advanced game tried to do absolutely everything? The ambitious action-adventure's been compared to a buffet, presenting players with a smorgasbord of ideas, gameplay styles and quests to gorge on. While some have praised it as a feast, others have found it overstuffed, with some undercooked morsels behind the impressive presentation.








IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation

Neural Information Processing Systems

As Large Language Models (LLMs) become more capable of handling increasingly complex tasks, the evaluation set must keep pace with these advancements to ensure it remains sufficiently discriminative. Item Discrimination (ID) theory, which is widely used in educational assessment, measures the ability of individual test items to differentiate between high and low performers. Inspired by this theory, we propose an ID-induced prompt synthesis framework for evaluating LLMs so that the evaluation set continually updates and refines according to model abilities.



Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts

Zhu, Siyu, Bian, Mouxiao, Xie, Yue, Tang, Yongyu, Yu, Zhikang, Li, Tianbin, Chen, Pengcheng, Han, Bing, Xu, Jie, Dong, Xiaoyan

arXiv.org Artificial Intelligence

With the rapid rise of large language models (LLMs) in medicine, a key question is whether they can function as competent pediatricians in real-world clinical settings. We developed PEDIASBench, a systematic evaluation framework centered on a knowledge-system framework and tailored to realistic clinical environments. PEDIASBench assesses LLMs across three dimensions: application of basic knowledge, dynamic diagnosis and treatment capability, and pediatric medical safety and medical ethics. We evaluated 12 representative models released over the past two years, including GPT-4o, Qwen3-235B-A22B, and DeepSeek-V3, covering 19 pediatric subspecialties and 211 prototypical diseases. State-of-the-art models performed well on foundational knowledge, with Qwen3-235B-A22B achieving over 90% accuracy on licensing-level questions, but performance declined ~15% as task complexity increased, revealing limitations in complex reasoning. Multiple-choice assessments highlighted weaknesses in integrative reasoning and knowledge recall. In dynamic diagnosis and treatment scenarios, DeepSeek-R1 scored highest in case reasoning (mean 0.58), yet most models struggled to adapt to real-time patient changes. On pediatric medical ethics and safety tasks, Qwen2.5-72B performed best (accuracy 92.05%), though humanistic sensitivity remained limited. These findings indicate that pediatric LLMs are constrained by limited dynamic decision-making and underdeveloped humanistic care. Future development should focus on multimodal integration and a clinical feedback-model iteration loop to enhance safety, interpretability, and human-AI collaboration. While current LLMs cannot independently perform pediatric care, they hold promise for decision support, medical education, and patient communication, laying the groundwork for a safe, trustworthy, and collaborative intelligent pediatric healthcare system.